-
Notifications
You must be signed in to change notification settings - Fork 246
feat: add background P2P init retries to SyncService #3002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the robustness of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @julienrbrt's task in 1m 59s —— View job Code Review CompleteI've completed a comprehensive review of this PR. The changes introduce a background retry mechanism for P2P initialization when the initial synchronous attempt times out. Overall, this is a valuable improvement for resilience, but I've identified several issues that need attention. SummaryWhat Changed:
Impact: This change prevents nodes from getting stuck indefinitely during P2P initialization, improving startup reliability and resilience when peers are temporarily unavailable. Critical Issues1. Race Condition in Store Initialization 🔴
The if syncService.storeInitialized.CompareAndSwap(false, true) {
if _, err := syncService.initStore(ctx, trusted); err != nil {
syncService.storeInitialized.Store(false)
return false, fmt.Errorf("failed to initialize the store: %w", err)
}
}Problem: If the initial sync times out at 2 minutes and background retry starts, but then a call to Impact: Potential data corruption or panics from concurrent store operations. Recommendation: Add a mutex to serialize store initialization or use a more robust synchronization pattern like 2. Goroutine Leak Risk 🔴
The background goroutine is launched with case <-timeoutTimer.C:
syncService.logger.Warn().Err(err).Msg("timeout reached while trying to initialize the store, scheduling background retry")
go syncService.retryInitInBackground()
return nilProblem: If Recommendation:
3. Context Misuse in tryInit 🟡
The background retry uses ok, err := syncService.tryInit(syncService.bgCtx)Problem: If the syncer is started with Recommendation: The syncer should use the service's main operational context, not the background context. Consider passing both contexts or creating a merged context. Major Issues4. No Observability for Background Retry State 🟡
There's no way to observe whether the background retry is running, how many retries have occurred, or when it succeeds/fails. Recommendation:
5. Timeout Value Lacks Configuration 🟡
The 2-minute timeout is hardcoded: timeoutTimer := time.NewTimer(time.Minute * 2)Problem: Different deployments may need different timeout values based on network conditions, DA layer characteristics, etc. Recommendation: Make this configurable via 6. Test Uses Hard Sleep Instead of Proper Synchronization 🟡
// Give the retry a moment to start and fail at least once (no headers available yet)
time.Sleep(100 * time.Millisecond)Problem: This test is flaky and time-dependent. On slower CI systems, 100ms might not be enough; on faster systems, it wastes time. Recommendation: Use proper synchronization (channels, mock callbacks, or monitoring atomic flags) instead of sleeps. Minor Issues7. Inconsistent Error Logging 🔵
The retry loop logs errors at syncService.logger.Info().Err(err).Dur("retry_in", backoff).Msg("headers not yet available from peers, waiting to initialize header sync")
syncService.logger.Info().Err(err).Dur("retry_in", backoff).Msg("background retry: headers not yet available from peers")Recommendation: Use 8. Missing Documentation for Background Context Fields 🔵
// context for background operations
bgCtx context.Context
bgCancel context.CancelFuncRecommendation: Expand the comment to explain:
9. Magic Numbers in Backoff Configuration 🔵
backoff := 15 * time.Second
maxBackoff := 5 * time.MinuteRecommendation: Extract these as named constants with comments explaining the rationale for these values. Positive Aspects ✅
Code Quality Assessment
RecommendationsMust Fix Before Merge:
Should Fix Before Merge: Nice to Have: Overall AssessmentThis PR addresses an important operational issue (nodes getting stuck during P2P init), and the approach is sound. However, there are critical concurrency issues that must be addressed before merging. The race condition in store initialization (#1) could lead to data corruption, and the goroutine leak risk (#2) could cause resource exhaustion in long-running deployments. After addressing the critical issues, this will be a solid improvement to the codebase's resilience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a background retry mechanism for P2P initialization in SyncService. When the initial synchronous initialization times out, it now schedules a background task to continue retrying. This is a good improvement for startup performance as it avoids blocking indefinitely.
The changes are well-structured, including the addition of a background context for lifecycle management and the extraction of initialization logic into a reusable tryInit method. However, I've found a critical issue in the new retryInitInBackground function. It appears to be incomplete, missing the exponential backoff logic and the closing brace for the for loop, which would cause a compilation error. Please see my detailed comment for a suggested fix.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3002 +/- ##
==========================================
+ Coverage 58.58% 58.75% +0.17%
==========================================
Files 110 110
Lines 10396 10419 +23
==========================================
+ Hits 6090 6122 +32
+ Misses 3662 3651 -11
- Partials 644 646 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3ccfda2 to
8baaa65
Compare
Closes: #2994